

LAB2_REPORT¶

experiment requirement¶

Learn how to perform ac and pz simulations on your circuit. Master the method of obtaining the zero and pole of a circuit from Bode plot and pz analysis. Analyze the difference between experimental data and calculated data. Strengthen the mastery of knowledge related to poles and poles.

the implementation and optimization steps¶

First observe the code of dnn.cpp, dnn.h and dnn_test.cpp, know that dnn_test.cpp reads the weight matrix and offset matrix that have been obtained before from the file, and also reads the picture information from the test image. Dnn function. First modify the path, use dnn_test.cpp to open the file correctly. The data type used by the function in the .dnn.h file is all defined as float.
In dnn.cpp, first write the code for the initial calculation label according to the instructions of ppt.
Create a project with dnn.cpp as the top file and dnn_test.cpp as the testbench.
Perform a C-simulation and find that the correct rate is 97 percent, which proves that the code is written correctly.
Perform the synthesis and find that the interval is very large, which is 85930.
Modify dnn.h to change the floating point to ap_fixed <8,3,AP_RND,AP_SAT>. (This is defined by first fixing the last two parameters, modifying the first two parameters until the correct rate is higher than 90% and the number of bits used is the least).
Further synthesis, the interval was found to be 33,142.
Modify the code: First, I switched the inner loop with the outer loop so that when the inner loop was forced to be unrolled during pipelining, the interval could be smaller as the interval is closely related to the length of the unrolled loop; second, I broke some original loops into separate loops so that the operations were performed only in the inner loop, which made my loops perfect or semi-perfect as stated in 01_Class_Intro_hls-2018-full.pdf.
Add directives: 1. I added dataflow to function dnn, which allows loops to operate concurrently. 2. I tried to use array partition to allow multiple data to be read from a certain array simultaneously so that the pipeline could be more effective. 3. I tried to add resource directive to the arrays to make use of ROM_2P_BRAM, which did not have any effect. 4. Add pipeline directive to all the loops.
After 8 and 9, the Interval was finally decreased to 276, but the latency and interval was quite close to each other, which means the dataflow actually did not work, and the simulation is not correct (when applying C/RTL-Cosimulation, the latency and interval were about 350).
Synthesize again, resulting in an incubation period of 335.
Add partitions to the hidden and output arrays. The time obtained is 237, which is less than 250. However, the wait and the interval are not much different, which means that the data flow does not work and the simulation result is incorrect.
Remove the data stream. Since the resources are sufficient, all the arrays are directly partitioned, which makes the pipeline more convenient. Modify the instruction to retain the loop of the loop in which the two matrices are multiplied, and the rest of the loop is expanded, because the other loops are relatively small, which can further increase the throughput rate.
The synthesis is carried out again, and the time interval is 245, which satisfies the requirements.

Resource analysis¶

Since I broke the loop where input_image was multiplied with w1, the total dsp usage was 74=32+32+10, which indicates the number of multipliers used.

experimental summary

The project’s interval was reduced to 276 after rewriting the code and applying dataflow, array partition and pipeline directive to it. The result shows that sometimes the hls may make errors, so we should not just rely on the result of the simulation when designing.